A short description of the post.
This report details the preliminary visualization preparation for a project undertaken for the course ISSS608 - Visual Analytics and Applications offered in SMU MITB. As one of the project deliverables, this assiganment is a sub-module (module 1) from our proposed project - Enabling optimization of Bike Sharing Operations – Bluebikes.
Based on Shayini, S. (2017, March 24). RPubs - Bike Sharing Data Analysis with R. RPubs. https://www.rpubs.com/shayini/bike_sharing, the analysis digs into data exploration aspect, we include geospatial data mapping in our project to enrich the data visualization. According to Low, C. T. (2018, September 5). Analysis and Visualization of Blue Bikes Sharing in Boston. DataScience+. https://datascienceplus.com/blue-bikes-sharing-in-boston/, the analysis and visualization were performed with mapping using leaflet. Specifically, dock stations are marked in map as well as trips in each dock station are present through heatmap. For dock stations distribution, popups function are used to show the name of the dock station by clicking, to enhance the user experience, we use the htmltool package to enable the label display by hovering house over a marker. As regard to the heatmap, the overall popularity of each dock station can be acquired, however we cannot gain the detailed trip number or station information from it. To narrow this gap, we intend to use dot density to represent the popularity of each dock station, and increase the interactivity by using cluster, on top of that, we add filters to allow us to investigate the usage of different type of users along one day.
Packages required
library(data.table) # for faster loading on large dataset
library(tidyverse) # load ggplot, dplyr
library(leaflet) # interactive mapping
library(leaflet.extras) # extra mapping for leaflet
library(lubridate) # formatting date and time
library(gridExtra) # multiple plot arrange
library(grid) # multiple plot arrange
library(psych) # to get describe function
library(janitor) # to get tabulate variables function
library(htmltool) # for HTML generation and output
library(crosstalk) # for filter inputs
packages = c('data.table','tidyverse','leaflet','leaflet.extras','lubridate','gridExtra','grid','psych','janitor', 'htmltools','crosstalk')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only= T )
}
#load trip data
trip <- read_csv('data/202001-bluebikes-tripdata.csv')
#load station data
station <- read.csv('data/current_bluebikes_stations.csv')
glimpse(trip)
Rows: 128,598
Columns: 15
$ tripduration <dbl> 478, 363, 284, 193, 428, 695, 1...
$ starttime <dttm> 2020-01-01 00:04:05, 2020-01-0...
$ stoptime <dttm> 2020-01-01 00:12:04, 2020-01-0...
$ `start station id` <dbl> 366, 219, 219, 396, 60, 372, 36...
$ `start station name` <chr> "Broadway T Stop", "Boston East...
$ `start station latitude` <dbl> 42.34278, 42.37331, 42.37331, 4...
$ `start station longitude` <dbl> -71.05747, -71.04102, -71.04102...
$ `end station id` <dbl> 93, 212, 212, 387, 49, 178, 23,...
$ `end station name` <chr> "JFK/UMass T Stop", "Maverick S...
$ `end station latitude` <dbl> 42.32034, 42.36884, 42.36884, 4...
$ `end station longitude` <dbl> -71.05118, -71.03978, -71.03978...
$ bikeid <dbl> 6005, 3168, 3985, 2692, 4978, 5...
$ usertype <chr> "Customer", "Subscriber", "Subs...
$ `birth year` <dbl> 1969, 2000, 2001, 1978, 1987, 1...
$ gender <dbl> 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0...
# Convert second to minutes and round it non decimal points
trip$minutes <- round(trip$tripduration/60, 0)
# Calculate the age from birth year
trip$age <- 2020 - trip$'birth year'
# Recode gender variables
trip$gender <- as.factor(recode(trip$gender, '0' = 'Female', '1' = 'Male', '2' = 'Prefer not to say'))
# Convert the time and date variable to correct format
trip$starttime <- ymd_hms(trip$starttime)
trip$stoptime <- ymd_hms(trip$stoptime)
# Separate date and time
trip <- trip %>%
mutate_at(vars(starttime), ymd_hms) %>%
mutate_at(vars(starttime), funs('start_date' = date(.)))
trip <- trip %>%
mutate_at(vars(stoptime), ymd_hms) %>%
mutate_at(vars(stoptime), funs('stop_date' = date(.)))
# Extracting day, weekday and hour
trip$day <- day(trip$starttime)
trip$weekday <- wday(trip$starttime, label = TRUE)
trip$hour <- hour(trip$starttime)
# Select relevant variable
trip_data <- select(trip, 'tripduration', 'start_date', 'stop_date', 'start station name', 'start station latitude', 'start station longitude', 'end station name' , 'end station latitude', 'end station longitude', 'usertype', 'gender', 'minutes', 'age', 'day', 'weekday', 'hour')
# Rename variable
colnames(trip_data) <- c('tripduration', 'start_date', 'stop_date', 'start_name', 'start_lat', 'start_long', 'end_name' ,'end_lat', 'end_long', 'usertype', 'gender', 'minutes', 'age', 'day', 'weekday', 'hour')
# Convert into factor
trip_data$weekday <- as.factor(trip_data$weekday)
glimpse(trip_data)
Rows: 128,598
Columns: 16
$ tripduration <dbl> 478, 363, 284, 193, 428, 695, 1336, 1329, 12...
$ start_date <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01...
$ stop_date <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01...
$ start_name <chr> "Broadway T Stop", "Boston East - 126 Border...
$ start_lat <dbl> 42.34278, 42.37331, 42.37331, 42.40933, 42.3...
$ start_long <dbl> -71.05747, -71.04102, -71.04102, -71.06382, ...
$ end_name <chr> "JFK/UMass T Stop", "Maverick Square - Lewis...
$ end_lat <dbl> 42.32034, 42.36884, 42.36884, 42.40986, 42.3...
$ end_long <dbl> -71.05118, -71.03978, -71.03978, -71.06632, ...
$ usertype <chr> "Customer", "Subscriber", "Subscriber", "Sub...
$ gender <fct> Female, Male, Male, Male, Male, Male, Female...
$ minutes <dbl> 8, 6, 5, 3, 7, 12, 22, 22, 22, 21, 21, 20, 1...
$ age <dbl> 51, 20, 19, 42, 33, 31, 51, 51, 51, 51, 51, ...
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ weekday <ord> Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed,...
$ hour <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
Before the analysis, we first visualize all the dock stations around Boston, and the dock station name will be displayed while hovering over the mouse.
DockStation <- station %>%
leaflet() %>%
setView(lng = -71.0589, lat = 42.3601, zoom = 13) %>%
addTiles() %>%
addMarkers(lng = station$Longitude,
lat = station$Latitude,
label = ~htmlEscape(station$Name))
DockStation
The overall popularity of each dock station.
#starting location
#mapping
trip_data %>%
leaflet() %>%
setView(lng = -71.0589, lat = 42.3601, zoom = 13) %>%
addTiles() %>%
addHeatmap(lng = trip_data$start_long,
lat = trip_data$start_lat,
max = 2,
radius = 15)
Since there are a great number of stations on the map, to avoid the large overlap of dots, we cluster them for better visualize the popularity of each dock station as well as the distribution of dock stations. On top of that, we use both dot size and dot color to represent the number of trip for each station. Right click the circle to zoom in and acquire the dock station name and number of trip.
# Make mapping color palette for numerical variables
popularity<-tabyl(trip_data, start_name)
cPal<-colorNumeric(palette = "YlOrRd", domain = popularity$n)
#starting location
#mapping
trip_data %>%
leaflet() %>%
setView(lng = -71.0589, lat = 42.3601, zoom = 13) %>%
addTiles() %>%
addCircleMarkers(clusterOptions = markerClusterOptions(),
lng = trip_data$start_long,
lat = trip_data$start_lat,
color = cPal(popularity$n),
stroke = FALSE,
radius = sqrt(popularity$n)/5,
fillOpacity = 1,
popup = paste('Station:',trip_data$start_name, '<br>',
'Number of trips:',popularity$n))